Conversation
|
benchmark results : It seems like we are 25x faster for u16 bitmap based accumulators (or I am sleepy :) ) |
|
I think we can do the same for 16 bit types, it is just 65_536 bytes 8192 if we use a bitmap. |
|
Oh wait, you're already doing that :) |
|
Query 0 in clickbench_extended dataset (which uses (Other queries are faster but I believe that is more around variance ) |
|
cc : @neilconway , @alamb , @martin-g . Please take a look whenever you get a chance |
This comment has been minimized.
This comment has been minimized.
alamb
left a comment
There was a problem hiding this comment.
This looks like a great idea. Thank you @coderfender
| harness = false | ||
|
|
||
| [[bench]] | ||
| name = "count_distinct" |
There was a problem hiding this comment.
can you please add this benchmark as a separate PR (so we can use our standard benchmark runner to confirm the results)?
There was a problem hiding this comment.
Thats sounds like a good idea . Let me put up a PR to add benchmarks to count_distinct , approx_distinct on a separate PR
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| /// Uses 256 bytes to track all possible u8 values. | ||
| #[derive(Debug)] | ||
| pub struct BoolArray256DistinctCountAccumulator { | ||
| seen: Box<[bool; 256]>, |
There was a problem hiding this comment.
I think you can probably use a BooleanBuffer from Arrow to make this signifcantly faster (I think [bool uses a byte for each booelan) 🤔
There was a problem hiding this comment.
Great idea ! Let me try and experiment with that
There was a problem hiding this comment.
For small arrays this might not be necessarily faster (as it fits in L1 cache) - perhaps only if you can use instructions like popcount etc.?
There was a problem hiding this comment.
Although it probably wouldn't be very different anyway.
There was a problem hiding this comment.
It seems like BooleanBuffer is immutable. I didnt have luck using [u8:8] while trying to optimize approx_distinct :
#21453 (comment) and resorted the approach which yielded good speed up (~2x) vs regression using[u8;8]
There was a problem hiding this comment.
Results using MutableBuffer : (slower than directly using bitmap)
group bitmap_count_distinct bitmap_count_distinct_mutable_buffer main
----- --------------------- ------------------------------------ ---- 6.2±0.22µs ? ?/sec
count_distinct i16 bitmap 1.03 3.1±0.14µs ? ?/sec 1.00 3.0±0.02µs ? ?/sec 26.16 77.9±3.43µs ? ?/sec
count_distinct i64 80% distinct 1.03 48.2±0.44µs ? ?/sec 1.00 47.0±0.40µs ? ?/sec 1.02 48.0±1.49µs ? ?/sec
count_distinct i64 99% distinct 1.02 48.3±0.46µs ? ?/sec 1.00 47.2±0.92µs ? ?/sec 1.02 48.3±2.14µs ? ?/sec
count_distinct i8 bitmap 1.00 2.9±0.43µs ? ?/sec 1.37 3.9±0.06µs ? ?/sec 5.75 16.4±0.26µs ? ?/sec
count_distinct u16 bitmap 1.06 3.2±0.30µs ? ?/sec 1.00 3.0±0.16µs ? ?/sec 25.58 77.7±1.82µs ? ?/sec
count_distinct u8 bitmap 1.00 2.1±0.03µs ? ?/sec 1.90 3.9±0.08µs ? ?/sec 8.14
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
93acd98 to
7e67e2e
Compare
|
run benchmark count_distinct |
|
run benchmark clickbench |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
a13bcaa to
f8c01a1
Compare
|
Rebased with main |
3db92e3 to
df90ef5
Compare
|
Update benchmarks after rebase with main Command : |
Which issue does this PR close?
Remove hashset based accumulators for smaller int data types and use bitmaps. Follow up of : #21453
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?